Unveiling the sound of the cognitive status: Machine Learning-based speech analysis in the Alzheimer’s disease spectrum

Background Advancement in screening tools accessible to the general population for the early detection of Alzheimer’s disease (AD) and prediction of its progression is essential for achieving timely therapeutic interventions and conducting decentralized clinical trials. This study delves into the application of Machine Learning (ML) techniques by leveraging paralinguistic features extracted directly from a brief spontaneous speech (SS) protocol. We aimed to explore the capability of ML techniques to discriminate between different degrees of cognitive impairment based on SS. Furthermore, for the first time, this study investigates the relationship between paralinguistic features from SS and cognitive function within the AD spectrum. Methods Physical-acoustic features were extracted from voice recordings of patients evaluated in a memory unit who underwent a SS protocol. We implemented several ML models evaluated via cross-validation to identify individuals without cognitive impairment (subjective cognitive decline, SCD), with mild cognitive impairment (MCI), and with dementia due to AD (ADD). In addition, we established models capable of predicting cognitive domain performance based on a comprehensive neuropsychological battery from Fundació Ace (NBACE) using SS-derived information. Results The results of this study showed that, based on a paralinguistic analysis of sound, it is possible to identify individuals with ADD (F1 = 0.92) and MCI (F1 = 0.84). Furthermore, our models, based on physical acoustic information, exhibited correlations greater than 0.5 for predicting the cognitive domains of attention, memory, executive functions, language, and visuospatial ability. Conclusions In this study, we show the potential of a brief and cost-effective SS protocol in distinguishing between different degrees of cognitive impairment and forecasting performance in cognitive domains commonly affected within the AD spectrum. Our results demonstrate a high correspondence with protocols traditionally used to assess cognitive function. Overall, it opens up novel prospects for developing screening tools and remote disease monitoring. Supplementary information The online version contains supplementary material available at 10.1186/s13195-024-01394-y.


Introduction
The digitization in the healthcare field experienced in the last few years has led to the emergence of new technologies with great potential for Alzheimer's disease (AD) management [1].This neurodegenerative disease is the most frequent type of dementia worldwide.With no effective treatment yet available, AD poses a significant challenge to the sustainability of existing healthcare systems [2].As a result, the development of tools capable of detecting the disease in its early stages has become one of the most active research areas [3].
From a neuropsychological standpoint, AD typically manifests with impairments in memory, attention, language, executive and visuospatial functions, and behavior [2].These alterations worsen as the disease progresses, eventually limiting the individual's independence.At the pathophysiological level, AD is characterized by a cascade of brain-level events, including the accumulation of amyloid-β plaques (Aβ ), the formation of hyperphosphorylated tangles of tau protein (p-tau), and neuroinflammation [2,4].These neuropathological signatures ultimately lead to atrophy, decreased brain metabolism, and disruptions in brain connectivity, which cause the observed cognitive alterations [3][4][5].
In this context, it is well known that the entire landscape of neurological events in AD initiates several years before the onset of clinical symptoms and progresses silently until the first cognitive changes emerge [6].Consequently, numerous diagnostic criteria have been proposed, incorporating information from biomarkers such as A β and p-tau, measured by positron emission tomog- raphy (PET) or detected on cerebrospinal fluid (CSF), along with evidence of neurodegeneration assessed using magnetic resonance imaging (MRI) [4].However, the detection of biomarkers through neuroimaging or lumbar puncture requires specialized equipment and trained personnel, rendering these procedures expensive and inaccessible for most healthcare centers.Furthermore, most patients consult a memory unit when their cognitive impairment is already evident and only a minority when the initial neuropsychological symptoms arise [2].Collectively, these factors indicate that while these techniques have significant value for disease diagnosis, their use as population screening tools to identify individuals at risk of developing AD is limited.
Conversely, neuropsychological batteries have traditionally served as the initial assessment protocol for suspected cognitive impairment associated with AD [7,8].
Nevertheless, these batteries still rely on the presence of clinicians in a specialized memory unit, making it a timeconsuming procedure.Recognizing these limitations, numerous online abbreviated protocols have been proposed [1,9].Among these, neuropsychological assessment through the spontaneous speech (SS) analysis represents one of the most promising approaches [10].Previous research supports the presence of language deficits in AD patients several years before the progression to the dementia stage, rendering it a valuable tool for detecting individuals in the early stages of mild cognitive impairment (MCI) [11,12].Additionally, language abilities exhibit associations with other cognitive domains such as memory, attention, and executive functions, suggesting that SS analysis has the potential to offer an approximate representation of an individual's cognitive performance [12].
Moreover, the increasing interest in speech analysis using voice recordings stems from the versatility and abundance of information that can be extracted from this type of data.By employing modern natural language processing (NLP) techniques applied to automatic transcriptions [13] or based on the information derived from the raw waveform [14], it is possible to obtain information of great interest for assessing the cognitive performance of an individual.Within this wealth of information, the physical-acoustic features drawn directly from the raw sound represent an agnostic, standardized, and widely available resource [15,16].These features encompass parameters such as formants, pitch, and prosody, which are affected in numerous neurodegenerative diseases and affective disorders [15].
To date, most studies analyzing SS using Machine Learning (ML) techniques have focused on developing diagnostic tools to differentiate clinical phenotypes within the AD continuum.For instance, different research groups have shown that it is possible to obtain accuracies ranging from 80 to 90% for discriminating between healthy controls (HC) and AD dementia (ADD) using speech-derived information [17][18][19][20][21][22].Other authors have extended their efforts beyond distinguishing ADD and HC, aiming to identify individuals in the early stages of MCI.In this context, results vary considerably based on the technique or sample used, with accuracies from 65 to 80% [18,21,[23][24][25].In contrast, very few authors have focused on addressing the relationship between SS and other aspects of the disease [26,27].For example, the investigation of the connection between SS and neuropsychological impairment has received limited attention [10,28].This relevant aspect is not widely available since demonstrating that SS can be a reliable proxy for an individual's cognition requires its evaluation against standardized neuropsychological measures.However, most research has only utilized the Mini-Mental State Examination (MMSE) to explore the association of SS with cognitive status [29][30][31], typically within the context of the ADReSS challenge [32].Moreover, the sample sizes employed, usually consisting of a few hundred subjects, severely restrict the ability to draw conclusive findings regarding the expected performance of the models [28].
The present study aims to extend previous research investigating how information obtained from a paralinguistic analysis of various SS tests can differentiate among clinical phenotypes and predict cognitive performance.For this purpose, firstly, we applied ML techniques to differentiate individuals with preserved cognition (subjective cognitive decline, SCD), patients with MCI, and those with ADD.Secondly, we developed models to predict changes in cognitive performance based on SS over the neuropsychological domains of memory, attention, visuospatial and executive functions, and language.As input variables for the models, we focused on information extracted from SS using standardized physical acoustic features obtained from the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [15].Compared to previous works, our study utilizes a significantly larger population from a real-world setting, comprising SCD individuals, and patients with MCI, and ADD.In addition, as a novelty, our research delves into the connection between speech and changes in cognitive domain performance across the AD spectrum using ML and physical-acoustic features.

Study participants
This study comprised 1500 individuals who underwent evaluation at the Memory Clinic of Ace Alzheimer Center Barcelona (Ace) (single site) between March 2022 and April 2023.The participants were referred to the Memory Clinic either by their General Health practitioner due to subjective cognitive complaints, or they attended the Open House Initiative without a previous referral from a physician [33].All subjects completed neurological, neuropsychological, and social evaluations at Ace.The cognitive assessment included the Spanish version of the Mini-Mental State Examination (MMSE) [34], the memory test of the Spanish version of the 7 Minute screening neurocognitive battery [35], the short form of the Neuropsychiatric Inventory Questionnaire (NPI-Q) [36], the Hachinski's Ischemia Scale [37], the Blessed Dementia Rating Scale [38], the Clinical Dementia Rating (CDR) [39], and the complete Neuropsychological Battery of Fundació Ace (NBACE) [7].The final diagnosis for each participant was determined through consensus by a multidisciplinary team, including neurologists, neuropsychologists, and social workers, at a consensus diagnostic conference [40].

Ethical considerations
This study and its informed consent were approved by the ethics committees of the Hospital Universitari de Bellvitge (Barcelona) (ref.PR007/22) under Spanish biomedical laws (Law 14/2007, 3 July, regarding biomedical research; Royal Decree 1716/2011, 18 November) and followed the recommendations of the Declaration of Helsinki.All participants signed an informed consent for the SS protocol.

Acquisition and preprocessing of speech data
The SS protocol was carried out using the acceX ible platform on a tablet.The assessments were conducted in Spanish within a calm and controlled environment.Initially, the participants were presented The Cookie Theft Picture and were instructed to provide a comprehensive description of the image [44].Subsequently, they were given one minute to name as many different animals as possible.The participants' voices were recorded during the administration of these two tests, and the collected data was utilized for further analyses.The average duration of the protocol was 109.07 ± 15.54 s, with a minimum and maximum duration of 74.6 and 158.9 s respectively.The audio recordings were standardized to a frequency of 16 KHz, removing initial and final silent portions and applying the noise reduction model presented in [45] to eliminate potential environmental artifacts.The resulting audio data were manually reviewed.Then, features were extracted from the standardized feature set eGeMAPS [15] using the Python interface of the opensource toolkit OpenSMile [46].The set of features from the eGeMAPS are oriented to provide a simplified and standardized selection of relevant acoustic parameters for detecting physiological changes in voice production guided by findings of previous related studies [15].Based on prior research [19,20], we adopted the default configuration provided by the OpenSMILE library [15,46].The variables were calculated from the eighteen low level descriptors using a symmetric moving average filter of three frames long.For a more comprehensive understanding of the computation of the paralinguistic variables, readers are directed to [15].In total 176 variables were extracted from the voice recordings.

Calculation of cognitive composites
Neuropsychological composites created from the NBACE battery were used to determine the cognitive status of participants.Composite scores are a widely employed approach to capture common factors of variance across different neuropsychological tests.Their purpose is to simplify the information by eliminating redundancy between variables offering a more comprehensive characterization of the cognitive domain being assessed [47].In this study, five cognitive domains, typically examined in AD [48,49], were considered: memory, attention, visuospatial functions, executive functions, and language.Structural equation models (SEM) were used to calculate the composite scores based on the neuropsychological structure described in [7], and defined by an expert panel of neuropsychologists.
Briefly, the SEM framework defines a measurement model in which observed items y ∈ R i (e.g., i neuropsy- chological tests) are determined by unobservable factors η ∈ R p (e.g., p higher cognitive functions) according to y = υ + • η + ǫ , where υ ∈ R i corresponds to the intercepts of the regression paths, ∈ R i×p to the measurement slopes, and ǫ ∈ R i to the residual error.Additionally, the measurement model is subjected to a structural part that defines the relationship between observable and latent variables η = α + B • η + ζ , where α ∈ R p is a parameter vector, B ∈ R p×p is a non-singu- lar matrix with diag(B) = 0 indicating the relationship between latent variables, and ζ ∈ R p represents the latent variable residuals.
In the present study, the model parameters were adjusted using the robust maximum likelihood estimator, and the variances of the latent variables were fixed to 1 for model identification (unit variance method [47]).The model coefficients were estimated considering the baseline evaluations of the entire Ace database (N = 23,987)-including individuals HC/SCD, MCI, and ADD-using the R package lavaan [50].The composite scores were adjusted for age, sex, and years of formal education effects using linear regression models.Further details on the composites and their calculation can be found in the Appendix A.
The memory composite was created considering the variables long-term and recognition memory of the Word List subtest from the Wechsler Memory Scale, third version (WMS-III) [51].The attention composite included the Digit Forward and Digit Backwards from the Wechsler Adult Intelligence Scale, Third Edition (WAIS-III) [52].To define the visuospatial functions, the 15-Objects Test [53], the Poppelreuter-type overlap figures [54], and the Luria's Clock test [55] were considered.The executive functions were calculated from the Phonetic and Semantic Verbal fluencies [56,57] and the Automatic Inhibition subtest of the Syndrom Kurtz Test (SKT) [58].Finally, language function included the abbreviated 15-item naming test from the Boston Naming Test (BNT) [59] and the Verbal Comprehension and Repetitions [7].

Machine Learning modeling
In the present study, two different problems were addressed.Firstly, classification models were developed to differentiate between clinical phenotypes.Secondly, regression models were implemented to predict the cognitive composites outlined in the "Calculation of cognitive composites" section.The models used in each problem are described below.
For the classification problems, the following models were used: random forest (RF), extreme gradient boosting (XGB), support vector machine (SVM), and k-nearest neighbor (KNN).Due to the high dimensionality of the input data, the SVM and KNN algorithms were combined with a previous feature selection step.The feature selection aims to identify the optimal combination of variables by eliminating those that are irrelevant/redundant.We performed feature selection using a wrapper-based approach involving two sequential stages: candidate subset generation (SG) and subset evaluation (SE) [60].For the SG component, we utilized genetic algorithms (GA), a population-based metaheuristic optimization strategy inspired by the natural selection process [61].For the SE, we considered the mean balanced accuracy ( BA = 0.5 • [sensitivity + specificity] ) obtained through fivefold cross-validation (CV) from predictions derived from a Gaussian naive Bayes classifier.Feature selection was not applied for the RF and XGB models because these types of algorithms, not based on distance like SVM or KNN, are more robust to the high dimensionality.
Moreover, the hyperparameters for RF, XGB, and SVM models were determined using a hyperparameter optimization (HPO) framework.A nested ten-fold CV was applied to the training set to perform the HPO.The HPO was implemented using the Optuna open-source library [62], applying a Bayesian optimization (BO) using a tree-structured Parzen estimator (TPE) as a surrogate model [63].Since the KNN model had a small number of hyperparameters, they were optimized using a grid search.In addition, given the high-class imbalance in the classification problems, the KNN was combined with the synthetic minority over-sampling technique (SMOTE) applied to the minority class [64].
For the regression tasks, the same models adapted for predicting quantitative variables (RF, XGB, GA-SVM, and GA-KNN) were used.In this case, the SE step of the feature selection strategy used in the GA-SVM and GA-KNN was performed using a KNN regressor minimizing the mean squared error (MSE).For the BO-TPE, the MSE evaluated by a nested tenfold CV applied on the training set was minimized.The optimized hyperparameters for each model and the configuration details for BO-TPE and the GAs are provided in Appendix B.
All models were implemented using Python (v3.9.16).The scikit-learn library was utilized for RF, KNN, and SVM algorithms [65].The XGB package [66] was employed for the XGB models.Finally, for the GAs, the home-made library pywin EA2 available on GitHub was used.

Experimental setup
Among the main objectives of this study was to differentiate clinical phenotypes.For this purpose, the following problems were addressed: differentiation of individuals with a preserved cognitive state (SCD) from those with cognitive impairment (MCI and ADD), discrimination between SCD and ADD, classification of MCI and ADD, and the distinction between SCD and MCI.For the prediction of the cognitive composites, models were fitted on each of the five composites mentioned in the "Calculation of cognitive composites" section.
All models were evaluated using a ten-fold CV.Performance metrics were reported as the mean values obtained on the test set.The HPO and feature selection techniques, as described in the "Machine Learning modeling" section, were implemented using a nested CV approach on the training set to prevent overfitting.For classification tasks, CV was conducted with class stratification.Figure 1 illustrated the training and model validation pipeline applied for all the algorithms.
The following performance metrics were considered for the classification problems: where TP and TN represent true positives and negatives and FN and FP stand for false positives and negatives.For the regression problems, the correlation coefficient between model predictions ( Ŷ ∈ R n ) and true values ( Y ∈ R n ) was considered, as well as the following metrics: with Var[ • ] being the variance and |P v 95% − P v 5% | repre- senting the range of the variable v between the 95% and 5% percentiles.Consequently, the RMEA allows contextualizing the magnitude of the MEA in relation to the magnitude of the analyzed variable.
As input variables for all models, the 176 variables extracted from the voice recordings (the "Acquisition and preprocessing of speech data" section) and sociodemographic variables including age, years of education, and sex were considered.For the distance-based models such as SVM and KNN, the variables were standardized to z-scores based on the statistics of the training data.

Cognitive composites analysis
As mentioned in the "Calculation of cognitive composites" section, five cognitive composites were derived from the neuropsychological variables of the NBACE [7].These composites encompassed the cognitive domains of attention, executive functions, language, memory, and visuospatial functions.The SEM used for their estimation showed good fit indices supporting the consistency of the proposed factorial structure (comparative fit index = 0.93, Tucker-Lewis index = 0.92, mean square error of approximation = 0.07) [47].Figure 2 illustrates the composite values based on clinical diagnosis.Using linear regression models, significant differences in all composite scores across different diagnostic groups were found (Bonferroni adjusted p-value < 0.05) (refer to Appendix C for further details).Across all cognitive domains, as expected, SCD subjects exhibited the highest composite scores, while individuals with MCI displayed intermediate ones, and the ADD group showed the lowest composite values.
Table 2 presents the discriminatory performance achieved using a logistic regression model for each composite.The mean values obtained on the test set from ten repetitions of ten-fold CV are displayed.As expected, all composites exhibited a strong discriminatory ability between SCD and ADD individuals.They also offered a clear, although lower, differentiation of SCD and MCI subjects.Notably, the composites of executive functions, language, memory, and visuospatial functions provided the best discrimination between SCD and ADD individuals (AUC > 0.98).The attention composite showed a lower predictive performance (AUC ≈ 0.95).In the detection of MCI individuals, the values were slightly lower.The executive function composite showed the highest predictive capacity (AUC ≈ 0.93).The visuospatial, language, and memory composite scores also demonstrated good discriminatory ability (AUC > 0.90).The attention composite showed the poorest performance (AUC ≈ 0.84).

Spontaneous speech for differentiating clinical phenotypes
The results of the top-performing models for distinguishing clinical phenotypes are presented in Table 3. Due to significant imbalances between classes, the models with the highest F1-score value were selected.The receiver operating characteristic (ROC) curves for each problem, incorporating all algorithms, are depicted in Fig. 3.For the models that integrated GAs, the selected variables are detailed in the Supplementary material.Appendix E provides a summary of the features that were consistently chosen by the GAs during the CV.
Regarding the algorithms implemented, the treebased models showed the best performance for most of the problems.Specifically, the RF obtained the highest F1 for the differentiation between SCD and MCI/ ADD, the XGB achieved the best results in SCD-ADD and MCI-ADD problems, and the GA-SVM model outperformed the rest of the algorithms for distinguishing SCD and MCIs.The GA-KNN combination consistently demonstrated the poorest performance across all the problems (refer to Appendix D for further details).
When distinguishing between SCD and subjects with cognitive impairment (MCI/ADD), an F1-score of 0.85 was achieved.In this comparison, sensitivity and specificity were close to 0.75.The performance notably improved when differentiating between SCD and ADD, reaching an F1-score of 0.92, with a sensitivity of approximately 0.90 and specificity of around 0.80.In contrast, in all cases involving the identification of MCI subjects,    Finally, we conducted a subanalysis excluding sociodemographic information and fitting the top-performing models based purely on physical-acoustic variables.In this scenario, we observed a decline in performance compared to the previous models.Specifically, for distinguishing SCD from individuals with cognitive impairment, the F1-score dropped to 0.80 (from 0.85).In the discrimination between SCD and ADD, the F1-score decreased to 0.81 (from 0.92).For identifying SCD and MCI, the F1-score was similar (0.78 vs 0.77).Finally, in the classification of MCI and ADD, the F1-score moved to 0.55 (from 0.63).

Spontaneous speech for predicting cognitive domains
The predictions given by the best regression models used for estimating the five cognitive scores outlined in the "Calculation of cognitive composites" section, are collected in Fig. 4. Alternatively, the different regression metrics described in the "Experimental setup" section are presented in Table 4.The values of the former table were stratified by clinical phenotype.
Overall, the tree-based models exhibited superior performance.The RF model achieved the lowest MAE in predicting the attention score, while the XGB outperformed the other algorithms for predicting the remaining cognitive domains.Detailed results of all the models can be found in Appendix D. Furthermore, the variables selected by the GAs for the SVM and KNN models are listed in the Supplementary material and Appendix E.
A consistent correspondence between the model predictions and the actual values for each cognitive domain was observed.The language composite regression exhibited the best predictive performance, with a correlation coefficient of 0.57, an EV of 32.3%, and an RMAE of 17.8%.Similarly, for the executive and visuospatial functions, comparable results were achieved, with correlations above 0.56, EVs greater than 31.0%and RMAEs below 18.0%.The models also performed well for the memory and attention composites, although the correspondence between the predictions and the actual values decreased slightly (correlation < 0.5).When stratified by clinical diagnosis, the models performed better, with a fewer MAE, in subjects with MCI across all cognitive domains.The predictions for the SCD group showed the highest errors in attention, executive function, and memory composites.In contrast, the ADD group exhibited the highest error in the composites of language and visuospatial functions.Figure 5 illustrates the prediction distributions generated by the models stratified by the diagnostic group.The regression models predicted higher values in all the composites for the SCD, lower values for the MCIs, and a reduced estimate for the ADD individuals.

Table 4 Regression metrics obtained by the best models in the prediction of each cognitive domain
For each cognitive domain, the regression metrics obtained by the best models are presented.The random forest (RF) was the best model at predicting the attention score, while the extreme gradient boosting (XGB) performed better on all other scores.The metrics were calculated for the entire sample and stratified by clinical phenotype Abbreviations: MAE, mean absolute error; RMAE, relative MAE (described in the "Experimental setup" section); EV, explained variance; SCD, subjective cognitive decline; MCI, mild cognitive impairment; ADD, Alzheimer's disease dementia a The EV is not shown when the variance of the true values was lower than the variance of the residuals Similar to the classification problems (see the "Spontaneous speech for differentiating clinical phenotypes" section), we investigated the influence of demographic variables on the best models.For this purpose, we eliminated the variables of age, years of formal education, and sex from the models.In this context, we observed a marginal decline in their performance, while still maintaining correlations between predicted and true values that closely mirrored those obtained when demographic information was integrated.Specifically, for the attention composite, the correlation and EV were at 0.43 and 18.3%, respectively.Concerning executive function, the correlation and EV values shifted to 0.51 and 26.0%.The performance scores for the memory composite declined to 0.42 and 18.0%.For the language composite, the new values were 0.50 and 25.3%.Finally, for the visuospatial function, the correlation and EV dropped to 0.51 and 26.3% respectively.

Discussion
The present study shows that the automated analysis of a brief SS test using ML techniques consistently detects cognitive alterations along the AD spectrum.To the best of our knowledge, this is the first study with a large sample from a clinical setting exploring the association between SS and cognitive performance across neuropsychological domains.
Firstly, several ML models were applied to distinguish clinical phenotypes associated with AD.Our models focused on differentiating between individuals diagnosed with SCD, MCI, and ADD.The results demonstrated a good differentiation between SCD individuals from those with already manifest cognitive impairment (MCI/ ADD) (AUC = 0.84) as well as between SCD and ADD patients (AUC = 0.93) (Table 3).These findings are consistent with previous studies [17][18][19][20][21][22][67][68][69][70][71].For example, in [20], they achieved an accuracy of 80.28% on the test set for detecting subjects with dementia using information derived from The Cookie Theft Picture description task.In the same line, in [17], they reached an AUC of 0.86 for discriminating between HC and ADD subjects.Similarly, the authors of [21] obtained an AUC of 0.93 for distinguishing HC individuals from those with dementia and an AUC of 0.88 for differentiating dementia and nondementia subjects.Therefore, our findings provide new evidence supporting the potential of SS for the identification of individuals with ADD.
In contrast, the models exhibited a moderate predictive performance for identifying individuals with MCI (MCI vs SCD: AUC = 0.80, MCI vs ADD: AUC = 0.73).These findings are consistent with the results previously reported in the literature [18,21,[23][24][25].In their study, [18] identified MCI subjects with a BA of 0.65, slightly lower than the results presented in this work (BA = 0.69, see Table 3).With subtly better performance, in [23], they used linguistic features extracted from transcripts, reporting a specificity and sensitivity of 78% and 74%, respectively.However, it is important to note that their study was supported by a considerably smaller sample size than ours, limiting conclusions about the model's accuracy.Consistent with the findings of this work, in [21], they observed that the identification of subjects with MCI was notably more challenging compared to detecting dementia, obtaining an AUC of 0.74.Overall, the difficulties in identifying MCI stem from its inherent heterogeneity, encompassing different subtypes and potential underlying etiologies.Furthermore, the cognitive deficits in this population are subtle and can be influenced by factors such as age, education level, and individual cognitive abilities, often masking the underlying MCI status.Consequently, the boundary between normal cognition and MCI remains inherently fuzzy, limiting the performance of the models [42].Future studies should strive to overcome these limitations and enhance the ability to identify the cognitive changes that emerge during the MCI stage.Nevertheless, our results, based exclusively on the physical-acoustic properties of the sound, demonstrate that the application of Artificial Intelligence (AI) techniques to SS data offers valuable insights for identifying patients with MCI.
This study also investigated the potential to estimate cognitive performance using the paralinguistic variables derived from SS tests.For this purpose, neuropsychological tests from a standardized battery of neuropsychological measures were grouped into five composite scores representing the cognitive domains of attention, executive functions, language, memory, and visuospatial functions [7].These neurocognitive composites were the target variables of the ML models.As outlined in the "Cognitive composites analysis" section, the resulting composites discriminate between the different diagnostic groups.This discriminatory ability was expected, as the neuropsychological tests forming the composites are partially instrumental in defining the diagnoses.Nevertheless, these results show that their grouping into cognitive domains is consistent and effectively summarizes the information from the individual neuropsychological tests into higher cognitive functions.Therefore, this subanalysis supports the use of these neuropsychological groupings to describe the different cognitive functions analyzed.
The algorithms used to infer the neurocognitive composites based on the SS tests and physical-acoustic features showed a strong predictive ability (Fig. 4).In general, we observed that the models were proficient in predicting the different cognitive scores with a correlation close to 0.5 relative to the actual values (Table 4).Nevertheless, our findings indicated a significant increase in model errors within the SCD and ADD groups.For example, the correlation between model predictions and actual values in SCD subjects remained poor, likely due to the ceiling effect present in many of the neuropsychological tests used to construct the composites or an overrepresentation of MCI subjects within the sample.However, despite this, we observed a meaningful alignment of the model predictions with the different disease stages, associating higher values for SCD subjects, declining scores in the MCI stage, and lower estimates for ADD patients (Fig. 5).This result becomes especially relevant for the remote detection of cognitive impairment in the general population, as the SS test can be completed within an average duration of 110 s, and the correspondence between model predictions and disease stages is consistent.In addition, the increased predictive ability obtained for MCI patients (see Table 4) is particularly significant, as this stage holds great importance for developing screening tools, recruiting patients for clinical trials, and monitoring disease progression [2].
Regarding the models used, our results also provided insightful findings.First, we noted that tree-based algorithms, specifically RF and XGB, consistently surpassed distance-based algorithms (i.e., SVM and KNN) across tasks.We hypothesize that this superiority can be attributed to the high dimensionality of the input data and the robust nature of tree-based algorithms in handling non-smooth distributions [66,72].Moreover, despite incorporating a prior feature selection step to mitigate the sensitivity of distance-based algorithms to the high dimensionality, the GAs consistently selected a high number of variables.This elevated number of input features is presumably responsible for the poorer performance achieved by these models.It suggests that the effectiveness of feature selection techniques may not scale optimally with the increasing number of input variables [60].
On the other hand, we observed that excluding demographic variables from the models harmed their performance.These findings underscore the potential advantages of incorporating contextual information about the patient in SS and AI-based screening tools, enabling the algorithms to uncover nonlinear interactions among input variables [19,21].For instance, it is reasonable to expect that a specific response pattern on an SS test has distinct implications for an older non-literate individual compared to a young person with a high level of education.However, additional investigation is needed to evaluate the standalone predictive capacity of SS and the consequences of integrating contextual information into the models.In this regard, special attention should be directed towards variables that can be readily and efficiently collected through a remotely administered protocol, such as self-reported information on comorbidities or a family history of neurodegenerative diseases.
Our study shows that an automated analysis of speech based on paralinguistic features and ML techniques has the potential for detecting and assessing AD stages.Compared to other modalities, such as neuroimaging or plasma biomarkers, SS-based protocols represent a fast, cost-effective, and accessible tool for evaluating the patient's cognitive status despite lower diagnostic accuracy [10,73].The SS protocol is noninvasive, does not require expensive equipment or highly trained personnel, and is a patient-friendly procedure.Furthermore, beyond the implementation of SS as an early screening tool, periodic assessment of SS could provide valuable insights into the decline in different cognitive domains on a simple and rapid basis.Overall, it opens up new opportunities for implementing novel and widely accessible to the general population screening strategies.
From a methodological standpoint, our study benefits from using standardized SS tests and paralinguistic features, facilitating the replication of our findings.Moreover, unlike previous studies, we developed models capable of inferring cognitive performance from SS data using a large sample of subjects at different disease stages.In addition, we employed optimized and transparently evaluated ML models, providing detailed fit indices that enable easy comparison with other studies.Collectively, our work presents a promising avenue for leveraging automated speech analysis in AD, offering potential benefits for the early detection and monitoring of cognitive decline.
Nevertheless, this study has certain limitations.Firstly, despite using a large sample of subjects, especially compared with most current studies [17,19,20,23,68,71], the generalization of our results should be confirmed by future prospective analyses involving larger samples and individuals from different cohorts and in different languages.This recurrent issue is commonly encountered in studies employing ML techniques and represents a significant challenge for potential translation to the clinical setting.Secondy, our study was conducted using data generated in a controlled clinical environment, ensuring the acquisition of high-quality audio data.However, the remote administration of the SS protocol is expected to introduce noise and other factors that may potentially impact the performance of the models.Consequently, the evaluation of the effectiveness of the proposed approach with remotely generated data will be an aspect of interest to be explored in future studies.Moreover, our study relies solely on a restricted set of paralinguistic features.As demonstrated by other researchers [18,19,21], using more diverse data derived from speech analysis, such as employing NLP techniques [13], could substantially enhance the performance of predictive models.For future investigations, it may be worthwhile to consider including broader and more diverse SS parameters to improve the performance established in this study.Finally, this work has a cross-sectional design due to the restricted access to follow-up information.Future research endeavors will be necessary to explore how longitudinal analysis of SS data can provide relevant insights into predicting aspects related to the AD continuum, and differentiate ADD from other types of dementia.

Conclusion
In conclusion, the convergence of Artificial Intelligence advancements and the rapid digitization witnessed in recent years has set the stage for developing new technologies capable of monitoring AD in a simple and increasingly accessible manner.Among these innovative technologies, SS protocols, such as the one employed in this study, stand out.These technical breakthroughs hold great potential in enabling early and precise detection of cognitive changes within the AD continuum, ultimately facilitating remote access to specialists and personalized therapies.Our study provides new evidence to the field, demonstrating the feasibility of inferring cognitively impaired performance across different cognitive functions from SS data, establishing a solid foundation for future developments in predictive models.

Appendix A. Calculation of cognitive composites
Machine Learning (ML) models were developed for the prediction of neuropsychological composites.This appendix details the calculation of the composite scores generated from the Neuropsychological Battery from Fundació Ace (NBACE) battery [7].
As mentioned in the main manuscript, the memory composite was created considering the variables longterm and recognition memory of the Word List subtest from the Wechsler Memory Scale, third version (WMS-III) [51].The attention composite included the Digit Forward and Digit Backwards from the Wechsler Adult Intelligence Scale, Third Edition (WAIS-III) [52].To define the visuospatial functions, the 15-Objects Test [53], the Poppelreuter-type overlap figures [54], and the Luria's Clock test [55] were considered.The executive functions were calculated from the Phonetic and Semantic Verbal fluencies [56,57] and the Automatic Inhibition subtest of the Syndrom Kurtz Test (SKT) [58].Finally, language function included the abbreviated 15-item naming test from the Boston Naming Test (BNT) [59] and the Verbal Comprehension and Repetitions [7].The subtest measuring the time taken to complete the SKT was transformed to a logarithmic scale for structural equation models (SEM) fitting.
All available baseline observations collected in Ace were used for the composite scores calculation.Healthy controls (HC) and individuals with subjective cognitive decline (SCD) with a diagnosis of preserved cognition (CDR = 0), patients with mild cognitive impairment (MCI) (CDR = 0.5), and subjects with dementia due to Alzheimer's disease (ADD) (CDR ≥ 1) were selected.Table 5 lists the demographic characteristics of the subjects used to calculate the composite scores.The syntax of the structural equation model used to calculate the composite scores is given in Listing 1.The model fitted with the entire Ace database was subsequently used to infer the composite scores for the sample of subjects used for the study.

Appendix B. Machine Learning model hyperparameters
This appendix details the hyperparameters of all the ML models used in the main manuscript.Table 6 lists the hyperparameters of the algorithms used for the classification and regression problems.The library with the implementation of the genetic algorithm (GA) used is available on GitHub.Table 7 summarizes the hyperparameters of the GA.The Scikit-learn [65] library was used for models: random forest, support vector machines, and k-nearest neighbors.For the XGBoost algorithm, the xgboo st [66] library was employed.The hyperparameter optimization (HPO) was carried out using the optuna [62] library.The definition of the hyperparameters listed in the table can be found in the documentation of the respective libraries.For the hyperparameter optimization conducted using tree-structured Parzen estimator (TPE), a total of 1000 configurations were sampled, with the initial 500 being randomly selected a Hyperparameters not listed in this table were selected at their default value b Hyperparameters that were left fixed are specified by "-"

Appendix C. Cognitive composites analysis
This appendix collects the results of the regression models used to analyze the composites generated in the studied sample.Table 8 contains the values represented in Fig. 2 of the main manuscript.Coefficients of the regression models considering the composite value as the dependent variable and the dummy coding of the clinical diagnosis as the independent variable.The models were adjusted for sex, age, and educational level.The coefficients indicate the average value of the composite associated with each group adjusting for the covariates

Appendix D. Machine Learning model results
This appendix summarizes the results obtained by the classification (Table 9) and regression models (Table 10) applied for the differentiation of clinical phenotypes and the prediction of composite cognitive scores, respectively.For each model, the average results obtained on the test set of a ten-fold cross-validation are shown.

Appendix E. Features selected by the genetic algorithms
This appendix collects the features selected by the models that underwent a preliminary feature selection step using GAs (i.e., SVM and KNN).Table 11 shows the percentage of times in which each variable was selected in each fold for each problem.To simplify the information, only variables consistently chosen across different folds are included, encompassing those that appeared more than 75% of the time in each of the problems examined in the study.The total number of features selected in each of the folds by each of the algorithms in each problem is provided in a separate Excel document (see Selected-features-GA.xlsx).
Table 11 Percentage of times each feature was selected across cross-validation folds for the different problems by the genetic algorithms

Fig. 1
Fig.1Pipeline applied to train and evaluate the different models used in this study.The input data were divided into ten folds according to a cross-validation (CV) scheme.Feature selection and hyperparameter optimization (HPO) were conducted using a nested CV applied to the training data.The resulting model from this nested CV was then used to make the predictions on the test set.The final performance metrics were calculated based on the average performance of the predictions obtained on the test set

Fig. 2
Fig. 2 Cognitive composite values were obtained through structural equation models (SEM) and adjusted for age, sex, and years of education.The scores were represented based on the diagnostic group.Abbreviations: SCD, subjective cognitive decline; MCI, mild cognitive impairment; ADD, Alzheimer's disease dementia

Fig. 3
Fig.3 The average values of the receiver operating characteristic (ROC) curves, obtained for the models discussed in the "Machine learning modeling" section, are presented.The ROC curves were calculated on the test set from a ten-fold cross-validation.The analyzed classification tasks encompassed: A discrimination between subjective cognitive decline (SCD) and patients with mild cognitive impairment (MCI) or Alzheimer's disease dementia ADD, B SCD vs ADD, C SCD vs MCI, D and MCI vs ADD.Abbreviations: RF, random forest; XGB, extreme gradient boosting; GA, genetic algorithm; SVM, support vector machine; and KNN, k-nearest neighbor

Fig. 4 Fig. 5
Fig. 4 Correlation between the predicted values generated by the models and the actual values for each cognitive domain under examination.The predictions from the model with the lowest mean absolute error (MAE) were represented.The values depicted in the figure correspond to the predictions made on the test set from a ten-fold cross-validation.To minimize the influence of outliers in the representation, the scales of both the X and Y axes were adjusted considering the 95th percentile of the true values.Abbreviations: ρ , correlation between model predictions ( Ŷ ) and true values (Y); R 2 , coefficient of determination; RF, random forest; XGB, extreme gradient boosting

Table 1
Clinical and sociodemographic variables of the sample used for this study

Table 2
Discriminatory capacity of cognitive composites for differentiating clinical phenotypes using subjects with subjective cognitive decline (SCD) as referenceMean value obtained over the test set from ten repetitions of tenfold cross-validation.A logistic regression model was employed to predict clinical phenotypes.The composite scores were considered as independent variables Abbreviations: AUC , area under the curve; MCI, mild impairment; ADD, Alzheimer's disease dementia the performance decreased.When discerning between MCI and SCD, a specificity of 0.77 was reached, but at the expense of a high rate of false positives (sensitivity of 0.62).Moreover, the specificity for distinguishing ADD from MCI decreased to 0.71 while maintaining a low sensitivity (< 0.65).

Table 3
Average values of the classification metrics computed on the test set for the models achieving the highest F1-score in each of the problems Abbreviations: BA, balanced accuracy; F1, f1-score; RF, random forest; XGB, extreme gradient boosting; GA-SVM, genetic algorithm-support vector machine; SCD, subjective cognitive decline; MCI, mild cognitive impairment; ADD, Alzheimer's disease dementia

Table 5
Clinical and sociodemographic characteristics of the sample used to estimate the parameters of the structural equation models used to calculate the composite scores

Table 6
Hyperparameters of classification and regression models

Table 7
Hyperparameters of the genetic algorithm used to perform feature selection

Table 8
Regression model results for the study sample, controlling for sex, age, and years of formal education Abbreviations: CDR, Clinical Dementia Rating; CI, confidence interval a

Table 9
Average values of the classification metrics computed on the test set for the binary classification problems Abbreviations: BA, balanced accuracy; F1, f1-score; RF, random forest; XGB, extreme gradient boosting; GA-SVM, genetic algorithm-support vector machine; GA-KNN, genetic algorithm-K-nearest neighbors

Table 10
Regression metrics obtained in the prediction of the composite cognitive scores Abbreviations: MAE, mean absolute error; RMAE, relative MAE (described in the "Experimental setup" section); EV, explained variance; RF, random forest; XGB, extreme gradient boosting; GA-SVM, genetic algorithm-support vector machine; GA-KNN, genetic algorithm-K-nearest